DRR Research beyond COTS OCR Software: A Survey
نویسنده
چکیده
After decades of research, Optical Character Recognition (OCR) has entered into a relatively mature stage. Commercial off-the-shelf (COTS) OCR software packages have become powerful tools in Document Recognition and Retrieval (DRR) applications. One question naturally arises: What areas are left for new DRR research beyond COTS OCR software? There are many discussions around it in recent conferences. This paper attempts to address this question through a systematic survey of recently reported DRR projects as well as our own Digital Content Re-Mastering (DCRM) research at HP Labs. This survey has shown that custom DRR research is still in great need for better accuracy and reliability, complementary contents, or downstream information retrieval. Several concrete observations are also made on the basis of this survey: First, the basic character/word recognition is mostly taken on by COTS software, with a few exceptions. Second, system-level research with regard to reliability and guaranteed accuracy can seldom be replaced by COTS software. Third, document-level structure understanding still has much room to expand. Fourth, postOCR information retrieval also has many challenging research topics.
منابع مشابه
The OCRopus open source OCR system
OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.
متن کاملDRR is a teenager
The fifteenth anniversary of the first SPIE symposium (titled Character Recognition Technologies) on Document Recognition and Retrieval provides an opportunity to examine DRR’s contributions to the development of document technologies. Many of the tools taken for granted today, including workable general purpose OCR, large-scale, semi-automatic forms processing, inter-format table conversion, a...
متن کاملThe impact of running headers and footers on proximity searching
Hundreds of experiments over the last decade on the retrieval of OCR documents performed by the Information Science Research Institute have shown that OCR errors do not significantly affect retrievability. We extend those results to show that in the case of proximity searching, the removal of running headers and footers from OCR text will not improve retrievability for such searches.
متن کاملSoftware tools and test data for research and testing of page-reading OCR systems
We announce the availability of the UNLV/ISRI Analytic Tools for OCR Evaluation together with a large and diverse collection of scanned document images with the associated ground-truth text. This combination of tools and test data will allow anyone to conduct a meaningful test comparing the performance of competing page-reading algorithms. The value of this collection of software tools and test...
متن کاملDo Thesauri enhance rule-based categorization for OCR text?
A rule-based automatic text categorizer was tested to see if two types of thesaurus expansion, called query expansion and Junker expansion respectively, would improve categorization. Thesauri used were domainspecific to an OCR test collection focussed on a single topic. Results show that neither type of expansion significantly improved categorization.
متن کامل